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Abstract 

Pixel-level labelling tasks, such as semantic segmenta¬ 
tion, play a central role in image understanding. Recent ap¬ 
proaches have attempted to harness the capabilities of deep 
learning techniques for image recognition to tackle pixel- 
level labelling tasks. One central issue in this methodology 
is the limited capacity of deep learning techniques to de¬ 
lineate visual objects. To solve this problem, we introduce 
a new form of convolutional neural network that combines 
the strengths of Convolutional Neural Networks (CNNs) 
and Conditional Random Fields (CRFs)-basedprobabilistic 
graphical modelling. To this end, we formulate mean-field 
approximate inference for the Conditional Random Fields 
with Gaussian pairwise potentials as Recurrent Neural Net¬ 
works. This network, called CRF-RNN, is then plugged in 
as a part of a CNN to obtain a deep network that has de¬ 
sirable properties of both CNNs and CRTs. Importantly, 
our system fully integrates CRF modelling with CNNs, mak¬ 
ing it possible to train the whole deep network end-to-end 
with the usual back-propagation algorithm, avoiding offline 
post-processing methods for object delineation. 

We apply the proposed method to the problem of seman¬ 
tic image segmentation, obtaining top results on the chal¬ 
lenging Pascal VOC 2012 segmentation benchmark. 

1. Introduction 

Low-level computer vision problems such as semantic 
image segmentation or depth estimation often involve as¬ 
signing a label to each pixel in an image. While the feature 
representation used to classify individual pixels plays an im¬ 
portant role in this task, it is similarly important to consider 
factors such as image edges, appearance consistency and 
spatial consistency while assigning labels in order to obtain 
accurate and precise results. 

Designing a strong feature representation is a key chal- 
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lenge in pixel-level labelling problems. Work on this topic 
includes: TextonBoost [52], TextonForest [51], and Ran¬ 
dom Forest-based classifiers [50]. Recently, supervised 
deep learning approaches such as large-scale deep Convolu¬ 
tional Neural Networks (CNNs) have been immensely suc¬ 
cessful in many high-level computer vision tasks such as 
image recognition [31] and object detection [20]. This mo¬ 
tivates exploring the use of CNNs for pixel-level labelling 
problems. The key insight is to learn a strong feature rep¬ 
resentation end-to-end for the pixel-level labelling task in¬ 
stead of hand-crafting features with heuristic parameter tun¬ 
ing. In fact, a number of recent approaches including the 
particularly interesting works FCN [37] and DeepLab [10] 
have shown a significant accuracy boost by adapting state- 
of-the-art CNN based image classifiers to the semantic seg¬ 
mentation problem. 

However, there are significant challenges in adapting 
CNNs designed for high level computer vision tasks such as 
object recognition to pixel-level labelling tasks. Firstly, tra¬ 
ditional CNNs have convolutional filters with large recep¬ 
tive fields and hence produce coarse outputs when restruc¬ 
tured to produce pixel-level labels [37]. Presence of max¬ 
pooling layers in CNNs further reduces the chance of get¬ 
ting a fine segmentation output [10]. This, for instance, can 
result in non-sharp boundaries and blob-like shapes in se¬ 
mantic segmentation tasks. Secondly, CNNs lack smooth¬ 
ness constraints that encourage label agreement between 
similar pixels, and spatial and appearance consistency of the 
labelling output. Lack of such smoothness constraints can 
result in poor object delineation and small spurious regions 
in the segmentation output [59, 58, 32, 39]. 

On a separate track to the progress of deep learning 
techniques, probabilistic graphical models have been devel¬ 
oped as effective methods to enhance the accuracy of pixel- 
level labelling tasks. In particular, Markov Random Fields 
(MRFs) and its variant Conditional Random Fields (CRFs) 
have observed widespread success in this area [32, 29] and 
have become one of the most successful graphical models 
used in computer vision. The key idea of CRF inference 
for semantic labelling is to formulate the label assignment 
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problem as a probabilistic inference problem that incor¬ 
porates assumptions such as the label agreement between 
similar pixels. CRF inference is able to refine weak and 
coarse pixel-level label predictions to produce sharp bound¬ 
aries and fine-grained segmentations. Therefore, intuitively, 
CRFs can be used to overcome the drawbacks in utilizing 
CNNs for pixel-level labelling tasks. 

One way to utilize CRFs to improve the semantic la¬ 
belling results produced by a CNN is to apply CRF infer¬ 
ence as a post-processing step disconnected from the train¬ 
ing of the CNN [10]. Arguably, this does not fully harness 
the strength of CRFs since it is not integrated with the deep 
network. In this setup, the deep network is unaware of the 
CRF during the training phase. 

In this paper, we propose an end-to-end deep learn¬ 
ing solution for the pixel-level semantic image segmenta¬ 
tion problem. Our formulation combines the strengths of 
both CNNs and CRF based graphical models in one uni¬ 
fied framework. More specifically, we formulate mean-field 
approximate inference for the dense CRF with Gaussian 
pairwise potentials as a Recurrent Neural Network (RNN) 
which can refine coarse outputs from a traditional CNN in 
the forward pass, while passing error differentials back to 
the CNN during training. Importantly, with our formula¬ 
tion, the whole deep network, which comprises a traditional 
CNN and an RNN for CRF inference, can be trained end- 
to-end utilizing the usual back-propagation algorithm. 

Arguably, when properly trained, the proposed network 
should outperform a system where CRF inference is applied 
as a post-processing method on independent pixel-level pre¬ 
dictions produced by a pre-trained CNN. Our experimental 
evaluation confirms that this indeed is the case. We evalu¬ 
ate the performance of our network on the popular Pascal 
VOC 2012 benchmark, achieving a new state-of-the-art ac¬ 
curacy of 74.7%. Our source code and models are publicly 
available ^. 

2. Related Work 

In this section we review approaches that make use of 
deep learning and CNNs for low-level computer vision 
tasks, with a focus on semantic image segmentation. A wide 
variety of approaches have been proposed to tackle the se¬ 
mantic image segmentation task using deep learning. These 
approaches can be categorized into two main strategies. 

The first strategy is based on utilizing separate mecha¬ 
nisms for feature extraction, and image segmentation ex¬ 
ploiting the edges of the image [2, 38]. One representative 
instance of this scheme is the application of a CNN for the 
extraction of meaningful features, and using superpixels to 
account for the structural pattern of the image. Two repre¬ 
sentative examples are [19, 38], where the authors first ob- 
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tained superpixels from the image and then used a feature 
extraction process on each of them. The main disadvantage 
of this strategy is that errors in the initial proposals (e.g: 
super-pixels) may lead to poor predictions, no matter how 
good the feature extraction process is. Pinheiro and Col- 
lobert [46] employed an RNN to model the spatial depen¬ 
dencies during scene parsing. In contrast to their approach, 
we show that a typical graphical model such as a CRF can 
be formulated as an RNN to form a part of a deep network, 
to perform end-to-end training combined with a CNN. 

The second strategy is to directly learn a nonlinear model 
from the images to the label map. This, for example, was 
shown in [17], where the authors replaced the last fully con¬ 
nected layers of a CNN by convolutional layers to keep spa¬ 
tial information. An important contribution in this direction 
is [37], where Long et al. used the concept of fully con¬ 
volutional networks, and the notion that top layers obtain 
meaningful features for object recognition whereas low lay¬ 
ers keep information about the structure of the image, such 
as edges. In their work, connections from early layers to 
later layers were used to combine these cues. Bell et al. [5] 
and Chen et al. [10, 41] used a CRF to refine segmentation 
results obtained from a CNN. Bell et al. focused on material 
recognition and segmentation, whereas Chen et al. reported 
very significant improvements on semantic image segmen¬ 
tation. In contrast to these works, which employed CRF 
inference as a standalone post-processing step disconnected 
from the CNN training, our approach is an end-to-end train- 
able network that jointly learns the parameters of the CNN 
and the CRF in one unified deep network. 

Works that use neural networks to predict structured out¬ 
put are found in different domains. For example. Do et 
al. [14] proposed an approach to combine deep neural net¬ 
works and Markov networks for sequence labeling tasks. 
Jain et al. [26] has shown Convolutional Neural Networks 
can perform well like MRFs/CRFs approaches in image 
restoration application. Another domain which benefits 
from the combination of CNNs and structured loss is hand¬ 
writing recognition. In natural language processing, Yao 
et al. [60] shows that the performance of an RNN-based 
words tagger can be significantly improved by incorporat¬ 
ing elements of the CRF model. In [6], the authors com¬ 
bined a CNN with Hidden Markov Models for that purpose, 
whereas more recently, Peng et al. [45] used a modified ver¬ 
sion of CRFs. Related to this line of works, in [25] a joint 
CNN and CRF model was used for text recognition on nat¬ 
ural images. Tompson et al. [57] showed the use of joint 
training of a CNN and an MRF for human pose estimation, 
while Chen et al. [11] focused on the image classification 
problem with a similar approach. Another prominent work 
is [21], in which the authors express deformable part mod¬ 
els, a kind of MRF, as a layer in a neural network. In our 
approach, we cast a different graphical model as a neural 
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network layer. 

A number of approaches have been proposed for au¬ 
tomatic learning of graphical model parameters and joint 
training of classifiers and graphical models. Barbu et al. [4] 
proposed a joint training of a MRF/CRF model together 
with an inference algorithm in their Active Random Field 
approach. Domke [15] advocated back-propagation based 
parameter optimization in graphical models when approxi¬ 
mate inference methods such as mean-field and belief prop¬ 
agation are used. This idea was utilized in [28], where a bi¬ 
nary dense CRF was used for human pose estimation. Sim¬ 
ilarly, Ross et al. [47] and Stoyanov et al. [54] showed how 
back-propagation through belief propagation can be used to 
optimize model parameters. Ross et al. [21], in particular 
proposes an approach based on learning messages. Many 
of these ideas can be traced back to [55], which proposes 
unrolling message passing algorithms as simpler operations 
that could be performed within a CNN. In a different setup, 
Krahenbuhl and Koltun [30] demonstrated automatic pa¬ 
rameter tuning of dense CRF when a modified mean-field 
algorithm is used for inference. An alternative inference ap¬ 
proach for dense CRF, not based on mean-field, is proposed 
in [61]. 

In contrast to the works described above, our approach 
shows that it is possible to formulate dense CRF as an RNN 
so that one can form an end-to-end trainable system for se¬ 
mantic image segmentation which combines the strengths 
of deep learning and graphical modelling. 

After our initial publication of the technical report of this 
work on arXiv.org, a number of independent works [49, 35] 
appeared on arXiv.org presenting similar joint training ap¬ 
proaches for semantic image segmentation. 

3. Conditional Random Fields 

In this section we provide a brief overview of Condi¬ 
tional Random Fields (CRF) for pixel-wise labelling and 
introduce the notation used in the paper. A CRF, used in 
the context of pixel-wise label prediction, models pixel la¬ 
bels as random variables that form a Markov Random Field 
(MRF) when conditioned upon a global observation. The 
global observation is usually taken to be the image. 

Let Xi be the random variable associated to pixel i, 
which represents the label assigned to the pixel i and 
can take any value from a pre-defined set of labels C = 
{/i, / 2 , • • •, Let X be the vector formed by the ran¬ 
dom variables Xi, X 2 ,..., where N is the number of 
pixels in the image. Given a graph G = {V,E), where 
V = {Xi,X 2 ,... ,X 7 v}, and a global observation (im¬ 
age) I, the pair (I, X) can be modelled as a CRF charac¬ 
terized by a Gibbs distribution of the form P(X = x|I) = 
exp(—P(x|I)). Here P(x) is called the energy of 

the configuration x G and Z{1) is the partition func- 
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Figure 1. A mean-field iteration as a CNN. A single iteration of 
the mean-field algorithm can be modelled as a stack of common 
CNN layers. 


tion [33]. From now on, we drop the conditioning on I in 
the notation for convenience. 

In the fully connected pairwise CRF model of [29], the 
energy of a label assignment x is given by: 

E{x) = '^'ipy,{xi)+ '^'ipp{xi,Xj), (1) 

i i<j 

where the unary energy components 'ipuixi) measure the 
inverse likelihood (and therefore, the cost) of the pixel 
i taking the label Xi, and pairwise energy components 
^p{xi^Xj) measure the cost of assigning labels Xi^Xj to 
pixels i, j simultaneously. In our model, unary energies are 
obtained from a CNN, which, roughly speaking, predicts la¬ 
bels for pixels without considering the smoothness and the 
consistency of the label assignments. The pairwise ener¬ 
gies provide an image data-dependent smoothing term that 
encourages assigning similar labels to pixels with similar 
properties. As was done in [29], we model pairwise poten¬ 
tials as weighted Gaussians: 

M 

'tpp{xi,Xj) = iJ,{xi,Xj) (2) 

m=l 

where each for m = 1,..., M, is a Gaussian kernel 
applied on feature vectors. The feature vector of pixel i, 
denoted by f^, is derived from image features such as spatial 
location and RGB values [29]. We use the same features as 
in[29]. The function //(., called the label compatibility 
function, captures the compatibility between different pairs 
of labels as the name implies. 

Minimizing the above CRF energy X(x) yields the most 
probable label assignment x for the given image. Since this 
exact minimization is intractable, a mean-field approxima¬ 
tion to the CRF distribution is used for approximate max¬ 
imum posterior marginal inference. It consists in approxi¬ 
mating the CRF distribution P(X) by a simpler distribution 
(5(X), which can be written as the product of independent 
marginal distributions, i.e., Q(X) = Yl- Qi{Xi). The steps 
of the iterative algorithm for approximate mean-field infer¬ 
ence and its reformulation as an RNN are discussed next. 
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Algorithm 1 Mean-field in dense CRFs [29], broken down 
to common CNN operations. 

Qi{l) ^ exp {Ui{l)) for all i > Initialization 

while not converged do 

^ for all m 

> Message Passing 

> Weighting Filter Outputs 

> Compatibility Transform 

Qiil) ^ Uiil) - Qi{l) 

> Adding Unary Potentials 

Qi ^ i“®^P (OiCO) 

> Normalizing 

end while 


4. A Mean-field Iteration as a Stack of CNN 
Layers 

A key contribution of this paper is to show that the mean- 
field CRF inference can be reformulated as a Recurrent 
Neural Network (RNN). To this end, we first consider in¬ 
dividual steps of the mean-field algorithm summarized in 
Algorithm 1 [29], and describe them as CNN layers. Our 
contribution is based on the observation that filter-based ap¬ 
proximate mean-field inference approach for dense CRFs 
relies on applying Gaussian spatial and bilateral filters on 
the mean-field approximates in each iteration. Unlike the 
standard convolutional layer in a CNN, in which filters are 
fixed after the training stage, we use edge-preserving Gaus¬ 
sian filters [56, 42], coefficients of which depend on the 
original spatial and appearance information of the image. 
These filters have the additional advantages of requiring a 
smaller set of parameters, despite the filter size being po¬ 
tentially as big as the image. 

While reformulating the steps of the inference algorithm 
as CNN layers, it is essential to be able to calculate error 
differentials in each layer w.r.t. its inputs in order to be able 
to back-propagate the error differentials to previous layers 
during training. We also discuss how to calculate error dif¬ 
ferentials with respect to the parameters in each layer, en¬ 
abling their optimization through the back-propagation al¬ 
gorithm. Therefore, in our formulation, CRF parameters 
such as the weights of the Gaussian kernels and the label 
compatibility function can also be optimized automatically 
during the training of the full network. 

Once the individual steps of the algorithm are broken 
down as CNN layers, the full algorithm can then be for¬ 
mulated as an RNN. We explain this in Section 5 after dis¬ 
cussing the steps of Algorithm 1 in detail below. In Algo¬ 
rithm 1 and the remainder of this paper, we use Ui{l) to 
denote the negative of the unary energy introduced in the 
previous section, i.e., Ui{l) = — 2 pu{Xi = /). In the con¬ 


ventional CRF setting, this input Ui{l) to the mean-field al¬ 
gorithm is obtained from an independent classifier. 

4.1. Initialization 

In the initialization step of the algorithm, the operation 
Qi{l) ;|-exp(f7i(0), where Zi = Z); exp(J7i(^)), is 
performed. Note that this is equivalent to applying a soft- 
max function over the unary potentials U across all the la¬ 
bels at each pixel. The softmax function has been exten¬ 
sively used in CNN architectures before and is therefore 
well known in the deep learning community. This operation 
does not include any parameters and the error differentials 
received at the output of the step during back-propagation 
could be passed down to the unary potential inputs after per¬ 
forming usual backward pass calculations of the softmax 
transformation. 

4.2. Message Passing 

In the dense CRF formulation, message passing is imple¬ 
mented by applying M Gaussian filters on Q values. Gaus¬ 
sian filter coefficients are derived based on image features 
such as the pixel locations and RGB values, which refiect 
how strongly a pixel is related to other pixels. Since the 
CRF is potentially fully connected, each filter’s receptive 
field spans the whole image, making it infeasible to use a 
brute-force implementation of the filters. Fortunately, sev¬ 
eral approximation techniques exist to make computation 
of high dimensional Gaussian filtering significantly faster. 
Following [29], we use the Permutohedral lattice imple¬ 
mentation [1], which can compute the filter response in 
0{N) time, where N is the number of pixels of the im¬ 
age [1]. 

During back-propagation, error derivatives w.r.t. the fil¬ 
ter inputs are calculated by sending the error derivatives 
w.r.t. the filter outputs through the same M Gaussian fil¬ 
ters in reverse direction. In terms of permutohedral lattice 
operations, this can be accomplished by only reversing the 
order of the separable filters in the blur stage, while building 
the permutohedral lattice, splatting, and slicing in the same 
way as in the forward pass. Therefore, back-propagation 
through this filtering stage can also be performed in 0{N) 
time. Following [29], we use two Gaussian kernels, a spa¬ 
tial kernel and a bilateral kernel. In this work, for simplic¬ 
ity, we keep the bandwidth values of the filters fixed. It 
is also possible to use multiple spatial and bilateral kernels 
with different bandwidth values and learn their optimal lin¬ 
ear combination. 

4.3. Weighting Filter Outputs 

The next step of the mean-field iteration is taking a 
weighted sum of the M filter outputs from the previous step, 
for each class label 1. When each class label is considered 
individually, this can be viewed as usual convolution with 
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a 1 X 1 filter with M input channels, and one output chan¬ 
nel. Since both inputs and the outputs to this step are known 
during back-propagation, the error derivative w.r.t. the filter 
weights can be computed, making it possible to automat¬ 
ically learn the filter weights (relative contributions from 
each Gaussian filter output from the previous stage). Er¬ 
ror derivative w.r.t. the inputs can also be computed in the 
usual manner to pass the error derivatives down to the previ¬ 
ous stage. To obtain a higher number of tunable parameters, 
in contrast to [29], we use independent kernel weights for 
each class label. The intuition is that the relative impor¬ 
tance of the spatial kernel vs the bilateral kernel depends on 
the visual class. For example, bilateral kernels may have 
on the one hand a high importance in bicycle detection, be¬ 
cause similarity of colours is determinant; on the other hand 
they may have low importance for TV detection, given that 
whatever is inside the TV screen may have many different 
colours. 

4.4. Compatibility Transform 

In the compatibility transform step, outputs from the pre¬ 
vious step (denoted by Q in Algorithm 1) are shared be¬ 
tween the labels to a varied extent, depending on the com¬ 
patibility between these labels. Compatibility between the 
two labels I and I' is parameterized by the label compatibil¬ 
ity function The Potts model, given by /i(/, /') = 

[/ 7 ^ /'], where [.] is the Iverson bracket, assigns a fixed 
penalty if different labels are assigned to pixels with simi¬ 
lar properties. A limitation of this model is that it assigns 
the same penalty for all different pairs of labels. Intuitively, 
better results can be obtained by taking the compatibility 
between different label pairs into account and penalizing 
the assignments accordingly. For example, assigning labels 
“person” and “bicycle” to nearby pixels should have a lesser 
penalty than assigning labels “sky” and “bicycle”. There¬ 
fore, learning the function /i from data is preferred to fixing 
it in advance with Potts model. We also relax our compat¬ 
ibility transform model by assuming that /i(/,/') / 
in general. 

Compatibility transform step can be viewed as another 
convolution layer where the spatial receptive field of the fil¬ 
ter is 1 X 1, and the number of input and output channels 
are both L. Teaming the weights of this filter is equivalent 
to learning the label compatibility function /i. Transferring 
error differentials from the output of this step to the input 
can be done since this step is a usual convolution operation. 

4.5. Adding Unary Potentials 

In this step, the output from the compatibility transform 
stage is subtracted element-wise from the unary inputs U. 
While no parameters are involved in this step, transferring 
error differentials can be done trivially by copying the dif¬ 
ferentials at the output of this step to both inputs with the 


appropriate sign. 

4.6. Normalization 

Finally, the normalization step of the iteration can be 
considered as another softmax operation with no parame¬ 
ters. Differentials at the output of this step can be passed on 
to the input using the softmax operation’s backward pass. 

5. The End-to-end Trainable Network 

We now describe our end-to-end deep learning system 
for semantic image segmentation. To pave the way for this, 
we first explain how repeated mean-field iterations can be 
organized as an RNN. 

5.1. CRF as RNN 

In the previous section, it was shown that one iteration 
of the mean-field algorithm can be formulated as a stack of 
common CNN layers (see Fig. 1). We use the function fo 
to denote the transformation done by one mean-field iter¬ 
ation: given an image /, pixel-wise unary potential values 
U and an estimation of marginal probabilities Qin from the 
previous iteration, the next estimation of marginal distribu¬ 
tions after one mean-field iteration is given by fe (U, Qm, I). 
The vector 0 /j.{l J )}, m G {1,..., M},/, / G 

{/i,..., } represents the CRF parameters described in Sec¬ 
tion 4. 

Multiple mean-field iterations can be implemented by re¬ 
peating the above stack of layers in such a way that each 
iteration takes Q value estimates from the previous iteration 
and the unary values in their original form. This is equiv¬ 
alent to treating the iterative mean-field inference as a Re¬ 
current Neural Network (RNN) as shown in Fig. 2. Using 
the notation in the figure, the behaviour of the network is 
given by the following equations where T is the number of 
mean-field iterations: 


Hi{t) 

H2(t) 

Y{t) 


I softmax(f/), t = 0 
|i^2(t-l), 0<t<T, 

0<t<T, 

fo, 0<t<T 

t = T. 


( 3 ) 

( 4 ) 

( 5 ) 


We name this RNN structure CRF-RNN. Parameters of 
the CRF-RNN are the same as the mean-field parameters 
described in Section 4 and denoted by 6 here. Since the cal¬ 
culation of error differentials w.r.t. these parameters in a sin¬ 
gle iteration was described in Section 4, they can be learnt 
in the RNN setting using the standard back-propagation 
through time algorithm [48, 40]. It was shown in [29] that 
the mean-field iterative algorithm for dense CRF converges 
in less than 10 iterations. Furthermore, in practice, after 
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Figure 2. The CRF-RNN Network. We formulate the iterative 
mean-field algorithm as a Recurrent Neural Network (RNN). Gat¬ 
ing functions Gi and G 2 are fixed as described in the text. 



Figure 3. The End-to-end Trainable Network. Schematic vi¬ 
sualization of our full network which consists of a CNN and the 
CNN-CRF network. Best viewed in colour. 


about 5 iterations, increasing the number of iterations usu¬ 
ally does not significantly improve results [29]. Therefore, 
it does not suffer from the vanishing and exploding gradient 
problem inherent to deep RNNs [7, 43]. This allows us to 
use a plain RNN architecture instead of more sophisticated 
architectures such as LSTMs in our network. 

5.2. Completing the Picture 

Our approach comprises a fully convolutional network 
stage, which predicts pixel-level labels without consid¬ 
ering structure, followed by a CRF-RNN stage, which 
performs CRF-based probabilistic graphical modelling for 
structured prediction. The complete system, therefore, uni¬ 
fies strengths of both CNNs and CRFs and is trainable 
end-to-end using the back-propagation algorithm [34] and 
the Stochastic Gradient Descent (SGD) procedure. During 
training, a whole image (or many of them) can be used as 
the mini-batch and the error at each pixel output of the net¬ 
work can be computed using an appropriate loss function 
such as the softmax loss with respect to the ground truth 
segmentation of the image. We used the FCN-8s architec¬ 


ture of [37] as the first part of our network, which provides 
unary potentials to the CRF. This network is based on the 
VGG-16 network [53] but has been restructured to perform 
pixel-wise prediction instead of image classification. The 
complete architecture of our network, including the FCN- 
8s part can be found in the appendix. 

In the forward pass through the network, once the com¬ 
putation enters the CRF-RNN after passing through the 
CNN stage, it takes T iterations for the data to leave the 
loop created by the RNN. Neither the CNN that provides 
unary values nor the layers after the CRF-RNN (i.e., the 
loss layers) need to perform any computations during this 
time since the refinement happens only inside the RNN’s 
loop. Once the output Y leaves the loop, next stages of the 
deep network after the CRF-RNN can continue the forward 
pass. In our setup, a softmax loss layer directly follows the 
CRF-RNN and terminates the network. 

During the backward pass, once the error differentials 
reach the CRF-RNN’s output F, they similarly spend T it¬ 
erations within the loop before reaching the RNN input U 
in order to propagate to the CNN which provides the unary 
input. In each iteration inside the loop, error differentials 
are computed inside each component of the mean-field it¬ 
eration as described in Section 4. We note that unnecessar¬ 
ily increasing the number of mean-field iterations T could 
potentially result in the vanishing and exploding gradient 
problems in the CRF-RNN. We, however, did not experi¬ 
ence this problem during our experiments. 

6. Implementation Details 

In the present section we describe the implementation 
details of the proposed network, as well as its training pro¬ 
cess. The high-level architecture of our system, which was 
implemented using the popular Caffe [27] deep learning li¬ 
brary, is shown in Fig. 3. Complete architecture of the deep 
network can be found in the appendix. The full source code 
and the trained models of our approach will be made pub¬ 
licly available. 

We initialized the first part of the network using the pub¬ 
licly available weights of the FCN-8s network [37]. The 
compatibility transform parameters of the CRF-RNN were 
initialized using the Potts model, and kernel width and 
weight parameters were obtained from a cross-validation 
process. We found that such initialization results in faster 
convergence of training. During the training phase, param¬ 
eters of the whole network were optimized end-to-end using 
the back-propagation algorithm. In particular we used full 
image training described in [37], with learning rate fixed at 
10“^^ and momentum set to 0.99. These extreme values of 
the parameters were used since we employed only one im¬ 
age per batch to avoid reaching memory limits of the GPU. 

In all our experiments, during training, we set the num¬ 
ber of mean-field iterations T in the CRF-RNN to 5 to avoid 
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vanishing/exploding gradient problems and to reduce the 
training time. During the test time, iteration count was in¬ 
creased to 10. The effect of this parameter value on the 
accuracy is discussed in section 7.1. 

Loss function During the training of the models that 
achieved the best results reported in this paper, we used the 
standard softmax loss function, that is, the log-likelihood 
error function described in [30]. The standard metric used 
in the Pascal VOC challenge is the average intersection over 
union (lU), which we also use here to report the results. In 
our experiments we found that high values of lU on the val¬ 
idation set were associated to low values of the averaged 
softmax loss, to a large extent. We also tried the robust log- 
likelihood in [30] as a loss function for CRF-RNN training. 
However, this did not result in increased accuracy nor faster 
convergence. 

Normalization techniques As described in Section 4, 
we use the exponential function followed by pixel-wise nor¬ 
malization across channels in several stages of the CRF- 
RNN. Since this operation has a tendency to result in small 
gradients with respect to the input when the input value is 
large, we conducted several experiments where we replaced 
this by a rectifier linear unit (ReLU) operation followed by 
a normalization across the channels. Our hypothesis was 
that this approach may approximate the original operation 
adequately while speeding up the training due to improved 
gradients. Furthermore, ReLU would induce sparsity on the 
probability of labels assigned to pixels, implicitly pruning 
low likelihood configurations, which could have a positive 
effect. However, this approach did not lead to better re¬ 
sults, obtaining 1% lU lower than the original setting per¬ 
formance. 

7. Experiments 

We present experimental results with the proposed CRF- 
RNN framework. We use these datasets: the Pascal VOC 
2012 dataset, and the Pascal Context dataset. We use the 
Pascal VOC 2012 dataset as it has become the golden stan¬ 
dard to comprehensively evaluate any new semantic seg¬ 
mentation approach in comparison to existing methods. We 
also use the Pascal Context dataset to assess how well our 
approach performs on a dataset with different characteris¬ 
tics. 

Pascal VOC Datasets 

In order to evaluate our approach with existing methods un¬ 
der the same circumstances, we conducted two main exper¬ 
iments with the Pascal VOC 2012 dataset, followed by a 
qualitative experiment. 

In the first experiment, following [37, 38, 41], we used 
a training set consisted of VOC 2012 training data (1464 
images), and training and validation data of [23], which 
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Figure 4. Qualitative results on the validation set of Pascal 

VOC 2012. FCN [37] is a CNN-based model that does not em¬ 
ploy CRF. Deeplab [10] is a two-stage approach, where the CNN 
is trained first, and then CRF is applied on top of the CNN output. 
Our approach is an end-to-end trained system that integrates both 
CNN and CRF-RNN in one deep network. Best viewed in colour. 


amounts to a total of 11,685 images. After removing the 
overlapping images between VOC 2012 validation data and 
this training dataset, we were left with 346 images from the 
original VOC 2012 validation set to validate our models on. 
We call this set the reduced validation set in the sequel. An¬ 
notations of the VOC 2012 test set, which consists of 1456 
images, are not publicly available and hence the final results 
on the test set were obtained by submitting the results to the 
Pascal VOC challenge evaluation server [18]. Regardless 
of the smaller number of images, we found that the relative 
improvements of the accuracy on our validation set were in 
good agreement with the test set. 

As a first step we directly compared the potential advan¬ 
tage of learning the model end-to-end with respect to alter¬ 
native learning strategies. These are plain FCN-8s without 
applying CRF, and with CRF as a postprocessing method 
disconnected from the training of FCN, which is compara¬ 
ble to the approach described in [10] and [41]. The results 
are reported in Table 1 and show a clear advantage of the 
end-to-end strategy over the offline application of CRF as a 
post-processing method. This can be attributed to the fact 
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that during the SGD training of the CRF-RNN, the CNN 
component and the CRF component learn how to co-operate 
with each other to produce the optimum output of the whole 
network. 

We then proceeded to compare our approach with all 
state-of-the-art methods that used training data from the 
standard VOC 2012 training and validation sets, and from 
the dataset published with [22]. The results are shown in 
Table 2, above the bar, and we can see that our approach 
outperforms all competitors. 

In the second experiment, in addition to the above train¬ 
ing set, we used data from the Microsoft COCO dataset [36] 
as was done in [41] and [12]. We selected images from 
MS COCO 2014 training set where the ground truth seg¬ 
mentation has at least 200 pixels marked with classes la¬ 
bels present in the VOC 2012 dataset. With this selection, 
we ended up using 66,099 images from the COCO dataset 
and therefore a total of 66,099 11,685 = 77,784 training 

images were used in the second experiment. The same re¬ 
duced validation set was used in this second experiment as 
well. In this case, we first fine-tuned the plain FCN-32s 
network (without the CRF-RNN part) on COCO data, then 
we built an FCN-8s network with the learnt weights and fi¬ 
nally train the CRF-RNN network end-to-end using VOC 
2012 training data only. Since the MS COCO ground truth 
segmentation data contains somewhat coarse segmentation 
masks where objects are not delineated properly, we found 
that fine-tuning our model with COCO did not yield signif¬ 
icant improvements. This can be understood because the 
primary advantage of our model comes from delineating 
the objects and improving fine segmentation boundaries. 
The VOC 2012 training dataset therefore helps our model 
learn this task effectively. The results of this experiment are 
shown in Table 2, below the bar, and we see that our ap¬ 
proach sets a new state-of-the-art on the VOC 2012 dataset. 

Note that in both setups, our approach outperforms com¬ 
peting methods due to the end-to-end training of the CNN 
and CRF in the unified CRF-RNN framework. We also 
evaluated our models on the VOC 2010, and VOC 2011 test 
set (see Table 2). In all cases our method achieves the state- 
of-the-art performance. 

In order to have a qualitative evidence about how CRF- 
RNN learns, we visualize the compatibility function learned 
after the training stage of the CRF-RNN as a matrix repre¬ 
sentation in Fig. 5. Element (i, j) of this matrix corresponds 
to /i(i, j) defined earlier: a high value at (i, j) implies high 
penalty for assigning label i to a pixel when a similar pixel 
(spatially or appearance wise) is assigned label j. For exam¬ 
ple we can appreciate that the learned compatibility matrix 
assigns a low penalty to pairs of labels that tend to appear 
together, such as [Motorbike, Person], and [Dining table. 
Chair]. 


Method 

Without COCO 

With COCO 

Plain FCN-Ss 

61.3 

68.3 

FCN-Ss and CRF 

63.7 

69.5 

disconnected 

End-to-end training of 

69.6 

72.9 

CRF-RNN 

Table 1. Mean lU accuracy of our approach, CRF-RNN, compared 


with similar methods, evaluated on the reduced VOC 2012 valida¬ 
tion set. 


Method 

VOC 2010 

VOC 2011 

VOC 2012 

test 

test 

test 

BerkeleyRC [3] 

n/a 

39.1 

n/a 

02PCPMC [8] 

49.6 

48.8 

47.8 

Divmbest [44] 

n/a 

n/a 

48.1 

NUS-UDS [16] 

n/a 

n/a 

50.0 

SDS [23] 

n/a 

n/a 

51.6 

MSRA- 
CFM [13] 

n/a 

n/a 

61.8 

FCN-Ss [37] 

n/a 

62.7 

62.2 

Hypercolumn [24] 

n/a 

n/a 

62.6 

Zoomout [38] 

64.4 

64.1 

64.4 

Context-Deep- 
CNN-CRF [35] 

n/a 

n/a 

70.7 

DeepLab- 
MSc [10] 

n/a 

n/a 

71.6 

Our method 
w/o COCO 

73.6 

72.4 

72.0 

BoxSup [12] 

n/a 

n/a 

71.0 

DeepLab [10, 

41] 

n/a 

n/a 

72.7 

Our method 
with COCO 

75.7 

75.0 

74.7 


Table 2. Mean lU accuracy of our approach, CRF-RNN, com¬ 
pared to the other approaches on the Pascal VOC 2010-2012 test 
datasets. Methods from the first group do not use MS COCO data 
for training. The methods from the second group use both COCO 
and VOC datasets for training. 


Pascal Context Dataset 

We conducted an experiment on the Pascal Context dataset 
[39], which differs from the previous one in the larger num¬ 
ber of classes considered, 59. We used the provided parti¬ 
tions of training and validation sets, and the obtained results 
are reported in Table 3. 


Method 

02P[8] 

CFM[13] 

FCN- 
Ss [37] 

CRF- 

RNN 

Mean lU 

18.1 

34.4 

37.78 

39.28 


Table 3. Mean lU accuracy of our approach, CRF-RNN, evaluated 
on the Pascal Context validation set. 
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Figure 5. Visualization of the learnt label compatibility ma¬ 
trix. In the standard Potts model, diagonal entries are equal to — 1, 
while off-diagonal entries are zero. These values have changed af¬ 
ter the end-to-end training of our network. Best viewed in colour. 


7.1. Effect of Design Choices 

We performed a number of additional experiments on the 
Pascal VOC 2012 validation set described above to study 
the effect of some design choices we made. 

We first studied the performance gains attained by our 
modifications to the CRF over the CRF approach proposed 
by [29]. We found that using different filter weights for dif¬ 
ferent classes improved the performance by 1.8 percentage 
points, and that introducing the asymmetric compatibility 
transform further boosted the performance by 0.9 percent¬ 
age points. 

Regarding the RNN parameter iteration count T, incre¬ 
menting it to T = 10 during the test time, from T = 5 
during the train time, produced an accuracy improvement 
of 0.2 percentage points. Setting T = 10 also during train¬ 
ing reduced the accuracy by 0.7 percentage points. We be¬ 
lieve that this might be due to a vanishing gradient effect 
caused by using too many iterations. In practice that leads 
to the first part of the network (the one producing unary po¬ 
tentials) receiving a very weak error gradient signal during 
training, thus hampering its learning capacity. 

End-to-end training after the initialization of CRF pa¬ 
rameters improved performance by 3.4 percentage points. 
We also conducted an experiment where we froze the FCN- 
8s part and fine-tuned only the RNN part (i.e., CRF param¬ 
eters). It improved the performance over initialization by 
only 1 percentage point. We therefore conclude that end-to- 
end training significantly contributed to boost the accuracy 
of the system. 

Treating each iteration of mean-field inference as an in¬ 
dependent step with its own parameters, and training end- 


to-end with 5 such iterations yielded a final mean lU score 
of only 70.9, supporting the hypothesis that the recurrent 
structure of our approach is important for its success. 

8. Conclusion 

We presented CRF-RNN, an interpretation of dense 
CRFs as Recurrent Neural Networks. Our formulation 
fully integrates CRF-based probabilistic graphical mod¬ 
elling with emerging deep learning techniques. In partic¬ 
ular, the proposed CRF-RNN can be plugged in as a part 
of a traditional deep neural network: It is capable of pass¬ 
ing on error differentials from its outputs to inputs dur¬ 
ing back-propagation based training of the deep network 
while learning CRF parameters. We demonstrate the use 
of this approach by utilizing it for the semantic segmenta¬ 
tion task: we form an end-to-end trainable deep network 
by combining a fully convolutional neural network with the 
CRF-RNN. Our system achieves a new state-of-the-art on 
the popular Pascal VOC segmentation benchmark. This im¬ 
provement can be attributed to the uniting of the strengths 
of CNNs and CRFs in a single deep network. 

In the future, we plan to investigate the advan¬ 
tages/disadvantages of restricting the capabilities of the 
RNN part of our network to mean-field inference of dense 
CRF. A sensible baseline to the work presented here would 
be to use more standard RNNs {e.g. LSTMs) that learn to 
iteratively improve the input unary potentials to make them 
closer to the ground-truth. 
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Mean lU 
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Our method 

74.7 

90.4 

55.3 

88.7 

68.4 

69.8 

88.3 

82.4 

85.1 

32.6 


DeepLab[10, 41] 

72.7 

89.1 

38.3 

88.1 

63.3 

69.7 

87.1 

83.1 

85.0 

29.3 


BoxSup[12] 

71.0 

86.4 

35.5 

79.7 

65.2 

65.2 

84.3 

78.5 

83.7 

30.5 


Methods trained w/o COCO 

Our method trained w/o COCO 

72.0 

87.5 

39.0 

79.7 

64.2 

68.3 

87.6 

80.8 

84.4 

30.4 


DeepLab-MSc-CRF-LargeFOV[10] 

71.6 

84.4 

54.5 

81.5 

63.6 

65.9 

85.1 

79.1 

83.4 

30.7 


Context _Deep _CNN_CRF [35] 

70.7 

87.5 

37.7 

75.8 

57.4 

72.3 

88.4 

82.6 

80.0 

33.4 


Zoomout[38] 

64.4 

81.9 

35.1 

1^2 

57.4 

56.5 

80.5 

74.0 

79.8 

22.4 


Hypercolumn [24] 

62.6 

68.7 

33.5 

69.8 

51.3 

70.2 

81.1 

71.9 

74.9 

23.9 


FCN-8s[37] 

62.2 

76.8 

34.2 
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49.4 

60.3 

75.3 

74.7 

77.6 

21.4 
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61.8 

75.7 

26.7 

69.5 

48.8 

65.6 

81.0 

69.2 
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30.0 


SDS[23] 

51.6 

63.3 

25.7 

63.0 

39.8 

59.2 

70.9 

61.4 

54.9 

16.8 


NUS-UDS [16] 

50.0 

67.0 
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412 

45.0 

47.9 

65.3 

60.6 

58.5 

15.5 


TTIC-divmbest-rerank[44] 

48.1 

62.7 

25.6 

46.9 

43.0 

54.8 

58.4 

58.6 

55.6 

14.6 


B0NN_02PCPMC_FGT_SEGM [8] 

47.8 

64.0 

27.3 

54.1 

39.2 

48.7 

56.6 

57.7 

52.5 

14.2 
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80.5 

77.8 

83.1 

80.6 

59.5 

82.8 

47.8 

78.3 

67.1 

DeepLab-MSc-CRF-LargeFOV [10] 

74.1 

59.8 

79.0 

76.1 

83.2 

80.8 

59.7 

82.2 

50.4 

73.1 

63.7 

Context-Deep _CNN_CRF [35] 

71.5 

55.0 

79.3 

78.4 

81.3 

82.7 

56.1 

79.8 

48.6 

77.1 

66.3 

TTI_zoomout_16[38] 

69.6 

53.7 

74.0 

76.0 

76.6 

68.8 

44.3 

70.2 

40.2 

68.9 

55.3 

Hypercolumn [24] 

60.6 

46.9 

12 A 

68.3 

74.5 

72.9 

52.6 

64.4 

45.4 

64.9 

57.4 

FCN-8s[37] 

62.5 

46.8 

71.8 

63.9 

76.5 

73.9 

45.2 

72.4 

37.4 

70.9 

55.1 

MSRA_CFM[13] 

68.7 

51.5 

69.1 

68.1 

71.7 

67.5 

50.4 

66.5 

44.4 

58.9 

53.5 

SDS[23] 

45.0 

48.2 

50.5 

51.0 

57.7 

63.3 

31.8 

58.7 

31.2 

55.7 

48.5 

NUS_UDS[16] 

50.8 

37.4 

45.8 

59.9 

62.0 

52.7 

40.8 

48.2 

36.8 

53.1 

45.6 

TTIC-divmbest-rerank[44] 

47.5 

31.2 

44.7 

51.0 

60.9 

53.5 

36.6 

50.9 

30.1 

50.2 

46.8 

B0NN_02PCPMC_FGT_SEGM[8] 

54.8 

29.6 

42.2 

58.0 

54.8 

50.2 

36.6 

58.6 

31.6 

48.4 

38.6 


Table 4. Intersection over Union (lU) accuracy of our approach, CRF-RNN, compared to the other state-of-the-art approaches on the Pascal 
VOC 2012 test set. Scores for other methods were taken the results published by the original authors. The symbols are from Chatfield et 
al [9]. 
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Figure 6. Typical good quality segmentation results I. Illustration of sample results on the validation set of the Pascal VOC 2012 dataset. 
Note that in some cases our method is able to pick correct segmentations that are not marked correctly in the ground truth. Best viewed in 
colour. 
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Figure 7. Typical good quality segmentation results II. Illustration of sample results on the validation set of the Pascal VOC 2012 dataset. 
Note that in some cases our method is able to pick correct segmentations that are not marked correctly in the ground truth. Best viewed in 
colour. 
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Figure 8. Failure cases I. Illustration of sample failure cases on the validation set of the Pascal VOC 2012 dataset. Best viewed in colour. 
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Figure 9. Failure cases II. Illustration of sample failure cases on the validation set of the Pascal VOC 2012 dataset. Best viewed in colour. 
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Figure 10. Qualitative comparison with the other approaches. Sample results with our method on the validation set of the Pascal VOC 
2012 dataset, compared with previous state-of-the-art methods. Segmentation results with DeepLap approach were reproduced from the 
original publication. Best viewed in colour. 
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